Search CORE

28 research outputs found

Simulation of multi-core systems

Author: Basso Pedro Martins
Publication venue
Publication date: 01/01/2021
Field of study

The computer systems market has been increasing significantly since the beginning of the cloud computing era. This demand leads to an increase on computer architectures complexity and efficiency. The simulation step is one of the most important during the development of new architectures, it eliminates the need of the real hardware during the initial developing phases. In this work, we propose an ARM Neoverse N1 gem5 simulator model. We calibrate the cache memories of the model using microbenchmarks on the model and comparing with the real hardware architecture. The results of the work show that our calibration method reaches cache delay access time accuracy close to the real hardware.O mercado de sistemas de computação tem aumentado significativamente desde o início da era da computação na nuvem. Esta demanda leva a um aumento na complexidade e eficiência das arquiteturas de computadores. A etapa de simulação é uma das mais importantes durante o desenvolvimento destas: ela elimina a necessidade do hardware real durante as fases iniciais do fluxo de desenvolvimento. Neste trabalho, propomos um modelo de simulação da arquitetura ARM Neoverse N1 utilizando o simulador gem5. Calibramos as memórias caches do modelo usando microbenchmarks, comparando com o hardware real que implementa esta arquitetura . Os resultados do trabalho mostram que nosso método de calibração atinge tempos de acesso às caches próximo ao hardware real

Lume 5.8

High-Performance Solvers for Dense Hermitian Eigenproblems

Author: Bientinesi Paolo
Peise Elmar
Petschow Matthias
Publication venue
Publication date: 25/09/2012
Field of study

We introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACK's components

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Optimal load balancing techniques for block-cyclic decompositions for matrix factorization

Author: Strazdins Peter
Publication venue
Publication date: 01/01/1998
Field of study

In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel block-partitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re-)distributed among all processors. It is shown how this technique can be eÆciently implemented for the general block-cyclic matrix distribution, requiring only the collective communication primitives that required for block-cyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance and minimize the cost of panel re-distribution, storage block sizes should be kept small; furthermore, in many situations of interest, there will be no significant communication startup penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization for each method that is developed here

The Australian National University

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Author: Blum Volker
Corsetti Fabiano
García Alberto
Huhn William P.
Jacquelin Mathias
Jia Weile
Lange Björn
Lin Lin
Lu Jianfeng
Mi Wenhui
Seifitokaldani Ali
Vázquez-Mayagoitia Álvaro
Yang Chao
Yang Haizhao
Yu Victor Wen-zhe
Publication venue: 'Elsevier BV'
Publication date: 31/05/2017
Field of study

Solving the electronic structure from a generalized or standard eigenproblem is often the bottleneck in large scale calculations based on Kohn-Sham density-functional theory. This problem must be addressed by essentially all current electronic structure codes, based on similar matrix expressions, and by high-performance computation. We here present a unified software interface, ELSI, to access different strategies that address the Kohn-Sham eigenvalue problem. Currently supported algorithms include the dense generalized eigensolver library ELPA, the orbital minimization method implemented in libOMM, and the pole expansion and selected inversion (PEXSI) approach with lower computational complexity for semilocal density functionals. The ELSI interface aims to simplify the implementation and optimal use of the different strategies, by offering (a) a unified software framework designed for the electronic structure solvers in Kohn-Sham density-functional theory; (b) reasonable default parameters for a chosen solver; (c) automatic conversion between input and internal working matrix formats, and in the future (d) recommendation of the optimal solver depending on the specific problem. Comparative benchmarks are shown for system sizes up to 11,520 atoms (172,800 basis functions) on distributed memory supercomputing architectures.Comment: 55 pages, 14 figures, 2 table

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Studies in Rheology: Molecular Simulation and Theory

Author: Baig Chunggi
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2005
Field of study

With an enormous advance in the capability of computers during the last fewdecades, the computer simulation has become an important tool for scientific researches in many areas such as physics, chemistry, biology, and so on. In particular, moleculardynamics (MD) simulations have been proven to be of a great help in understanding the rheology of complex fluids from the fundamental microscopic viewpoint. There are two important standard flows in rheology: shear flow and elongational flow. While there exist suitable nonequilibrium MD (NEMD) algorithms of shear flows, such as the Lees-Edwards purely boundary-driven algorithm and the so-called SLLOD algorithm as a field-driven algorithm, a proper NEMD algorithm for elongational flow has been lacking. The main difficulty of simulating elongational flow lies in the limited simulation time available due to the contraction of one or two dimensions dictated by itskinematics. This problem, however, has been partially resolved by Kraynik and Reinelt’s ingenious discovery of the temporal and spatial periodicity of lattice vectors in planar elongational flow (PEF). Although there have been a few NEMD simulations of PEF using their idea, another serious defect has recently been reported when using the SLLOD algorithm in PEF: for adiabatic systems, the total linear momentum of the system in the contracting direction grows exponentially with time, which eventually leads to an aphysical phase transition.This problem has been completely resolved by using the so-called ‘proper-SLLOD’ or ‘p-SLLOD’ algorithm, whose development has been one of the mainaccomplishments of this study. The fundamental correctness of the p-SLLOD algorithm has been demonstrated quite thoroughly in this work through detailed theoretical analyses together with direct simulation results. Both theoretical and simulation works achieved in this research are expected to play a significant role in advancing the knowledge of rheology, as well as that of NEMD simulation itself for other types of flow in general. Another important achievement in this work is the demonstration of the possibility of predicting a liquid structure in nonequilibrium states by employing a concept of ‘hypothetical’ nonequilibrium potentials. The methodology developed in this work has been shown to have good potential for further developments in this field

University of Tennessee, Knoxville: Trace

Accelerated methods for performing the LDLT decomposition

Author: Strazdins Peter E.
Publication venue: Australian Mathematical Society
Publication date: 25/12/2000
Field of study

This paper describes the design, implementation and performance of parallel direct dense symmetric-indefinite matrix factorisation algorithms. These algorithms use the Bunch-Kaufman diagonal pivoting method. The starting point is numerically identical to LAPACK _sytrf() algorithm, but out-performs zsytrf() by approximately 15% for large matrices on the UltraSPARC family of processors. The first variant reduces symmetric interchanges, particularly important for parallel implementation, by taking into account the growth attained by any preceding columns that did not require any interchanges. However, it achieves the same growth bound. The second variant uses a lookahead technique with heuristic methods to predict whether interchanges are required over the next block column; if so, the block column can be eliminated using modified Cholesky methods, which can yield both computational and communication advantages. These algorithms yield best performance gains on `weakly indefinite' matrices (i.e. those which have generally large diagonal elements), which often arise from electro-magnetic field analysis applications. On UltraSPARC processors, the first variant generally achieves a 1--2% performance gain; the second is faster still for large matrices by 2% for complex double precision and 6% for double precision. However, larger performance gains are observed on distributed memory machines, where symmetric interchanges are relatively more expensive. On a 16 node 300MHz UltraSPARC-based Fujitsu AP3000, the first variant achieved a 10-15% improvement for small-moderate sized matrices, decreasing to 7% for large matrices. For N=10000 , it achieved a sustained speed of 5.6GFLOPs and a parallel speedup of 12.8

Australian Mathematical Society (AustMS): E-Journals

The deal.II Library, Version 9.0

Author: Alzetta Giovanni
Benjamin Brands
Bruno Turcksin
Daniel Arndt
David Wells
Denis Davydov
Heltai Luca
Jean-Paul Pelteret
Katharina Kormann
Martin Kronbichler
Matthias Maier
Rene Gassm\uf6ller
Timo Heister
Vishal Boddu
Wolfgang Bangerth
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2018
Field of study

This paper provides an overview of the new features of the finite element library deal.II version 9.0

OPUS Augsburg

Crossref

Sissa Digital Library

MPG.PuRe

Universidad Nacional De Colombia - Repositorio Institucional UN

Algorithmic Cholesky factorization fault recovery

Author: Doug Hakkarinen
Zizhong Chen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The implemen-tation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices

CiteSeerX

Crossref